Entropy-growth-based model of emotionally charged online dialogues 
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We analyze emotionally annotated massive data from IRC (Internet Relay Chat) and model 
the dialogues between its participants by assuming that the driving force for the discussion is the 
entropy growth of emotional probability distribution. This process is claimed to be correlated to 
the emergence of the power-law distribution of the discussion lengths observed in the dialogues. We 
perform numerical simulations based on the noticed phenomenon obtaining a good agreement with 
the real data. Finally, we propose a method to artificially prolong the duration of the discussion 
that relies on the entropy of emotional probability distribution. 

PACS numbers: 89.20. Hh, 89.75. He, 89.75.Da 
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I. INTRODUCTION 

The extensive records of data opened new possibilities 
of examining communication between humans ranging 
from face-to- face encounters , through mobile tele- 
phone calls 0, Q , surface-mail Q short messages @ to 
typical Internet activities such as e-mail correspondence 
0, bulletin board system (BBS) dialogues flOj . forum 
postings [ll| or Twitter microblogging [12| . 

Communication and its evolution is one of the key as- 
pects of a modern life, which in an overwhelming part is 
governed by the circulation of information. In the most 
fundamental part, the communication is based on a dia- 
logue - an exchange of information and ideas between two 
people Assuming an ideal situation, if the highest 

priority would be given to acquiring certain information, 
from a layman point of view the dialogue should be free 
from any additional components that could restrain con- 
versation's participants from achieving the common goal. 
In reality, it is extremely difficult to model the dialogue 
complexity which, among others, would need to consider 
the dialogues' semantic, pragmatic, social and emotional 
context sequences of turn-taking [3, EH, let alone its 
attentive [l(| or contextual [l?} layers. 

As compared to the off-line communication, the ex- 
change of information in the Internet is claimed to be 
more biased toward the emotional aspect lip- It can 
be explained by a online disinhibition effect [19| — the 
sense of anonymity that almost all Internet users pos- 
sess while submitting their opinions on various fora or 
blogs. Nevertheless, it is the very Internet that gives the 
opportunity to acquire massive data, thus making it pos- 
sible to perform a credible statistical analysis of common 
habits in communication. As the recent research shows, 
it is already possible to spot certain phenomena of the 
Internet discussion participants while looking just at the 
emotional content of their posts I20l-l24l . One of them is 
the collective emotional behavior |23|, the other is clear 



correlation between the length of discussion and its emo- 
tional content [HHl]. 

In this paper we argue that a simple physical approach 
based on the observation of entropy of emotional prob- 
ability distribution during the conversation can serve as 
an indicator of a discussion about to finish. This pro- 
cess is claimed to be correlated to the emergence of the 
power-law distribution of the discussion length and serves 
key idea for the numerical simulations of the di- 
alogues. The paper is organized as follows: Section |TT] 
gives a brief description of the used data as well as of 
the emotional classification method, Section ITU presents 
our observations regarding the discussion length distri- 
bution, equalization of the emotional probabilities and 
entropy growth, in Section IIVI we show the description 
of simulations rules which results are given in Section [V] 
Finally, Section I VII describes a potential application of 
the observed phenomenon. 
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FIG. 1: (Color online) An exemplary dialogue of L 
comments. 
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II. DATA DESCRIPTION 

As a source of data for analysing online dialogues we 
chose the Internet Relay Chat (IRC) [25| logs. Some 
of the the major IRC channels arc being automatically 
archived by the channel operators, the logs are often 
accessible to a general public, and include the records 
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FIG. 2: (Color online) Average emotional value (e)f (panels a, c, e, g, i) and average emotional probabilities (p(— )}f (squares), 
(p(0))f (circles), (p(+))f (triangles) in the i-th timestep (panels b, d, f, h, j) for dialogues of specific length L = 10 (a and b), 
L = 20 (c and d), L = 30 (e and f), L = 40 (g and h) and L = 50 (i and j). 



of real-time, chat-like communication between numerous 
participants. The presented analysis is limited only to 



0.10 



0.05 



< 



0.00 - 



I 1 






■ • 






w 












e 










• 












t 




/ 
1 


• 


• 








• 




• 

4 


ft 


It 
• 


> 








A ' 

V 


ft] 


• • 


• 

\ 

• 


< 


i 






-• — 






• 

\ 


o 

H 


i 


) 










O 

( 




o 





20 



40 60 
L 



80 100 



FIG. 3: (Color online) Difference between terminal and initial 
entropy value AS versus the dialogue length L. 



one of the channels, namely #ubuntu [26( in the period 
1st January 2007 - 31st December 2009. In this work we 
focused on dialogues that included only two participants. 
The final output, after several levels of data process- 
ing (for details see Appendix [3} consists of N = 93329 
dialogues with the length L between L m i n = 11 and 
L m ax = 339 each. Each dialogue can be represented as 
a chain of messages (sec Fig. [1| where all odd posts are 
submitted by one user and all even by another one. 

The emotional classifier program that was used to an- 
alyze the emotional content of the discussions is based 
on a machine-learning (ML) approach. The algorithm 
functions in two phases: during the training phase, it is 
provided with a set of documents classified by humans 
for emotional content (positive, negative or objective) 
from which it learns the characteristics of each category. 
Then, during the application phase, the algorithm ap- 
plies the acquired sentiment classification knowledge to 
new, unseen documents. In our analysis, we trained a hi- 
erarchical Language Model [13, HI] on the Blogs06 collec- 
tion [29| and applied the trained model to the extracted 
IRC dialogues, during the application phase. The algo- 
rithm is based on a two-tier solution, according to which 
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FIG. 4: (Color online) Entropy S- h of the average emotional probabilities distribution (p(e))f (topmost row) and average emo- 
tional probabilities (p( — ))f (squares) (p(0))f (circles) and (p(+))f (triangles) in the i-th timestep for BBC Forum discussions 
of specific length L — 10 (first column), L — 20 (second column), i = 30 (third column), L = 40 (fourth column) and L — 50 
(fifth column). 



a post is initially classified as objective or subjective and 
in the latter case, it is further classified in terms of its 
polarity, i.e., positive or negative. Each level of classifi- 
cation applies a binary Language Model 0, [3(| • Posts 
are therefore annotated with a single value e = —1,0 or 
1 to quantify their emotional content (to be more pre- 
cise - their valence (HJ) as negative, neutral or positive, 
respectively. 



III. COMMON FEATURES 

The obtained dialogues have been divided into groups 
of constant dialogue length L. For such data wc follow 
the evolution of mean emotional value (e}f and average 
emotional probabilities (p(e))f {(e) f . In both cases the 



symbol indicates taking all dialogues with a specific 
length L and averaging over all comments with number i, 
thus, for example, (p(—))f is the probability that at the 
position i in all dialogues of length L there is a negative 
statement. The characteristic feature observed regardless 
of the dialogue length is that the (e)f at the end of the 
dialogue is higher than at the beginning (upper row in 
Fig. [2]). In fact, there is especially a rapid growth close 
the very end of the dialogue. 

The direct reason for such behavior is shown in the 
bottom row of Fig. O which presents the evolution of 
the average emotional probabilities (p(— ))f , (p(0))f and 
(jp(+))f. The observations can be summarized in the 
following way: 

• the negative emotional probability (p(—))f remains 
almost constant, 
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FIG. 5: (Color online) Conditional probability p(e\ne) of con- 
secutive emotional post of the same sign versus the size n. Full 
triangles, squares and circles are data points (respectively: 
negative, neutral and positive messages), empty symbols are 
shuffled data, solid lines come from Eq. Q and dotted lines 
represent relation p(e\ne) = p(e). 

• increases and (p(0))f has an opposite ten- 
dency, 

• an d tend to equalize in the vicinity 
of dialogue end. 

Other manifestation of the system's features can be 
spotted by examining the level of the entropy S of the 
emotional probabilities (p(e))f . Entropy or other infor- 
mation theoretic quantities as mutual information [33| . 
Kullback-Leiber divergence [34[ or Jensen-Shannon di- 
vergence [HI have been already used to quantify certain 
aspects of human mobility j35j , se mantic resemblance or 
flow between Wikipedia pages [3f| [371 o r correlations be- 
tween consecutive emotional posts [38j. Moreover, bas- 
ing on entropy, it has also been shown how the coherent 
structures in the e-mail dialogues arise Q or how to pre- 
dict conversation patterns in face-to- face meetings Q . In 
this paper, the entropy is used after Shannon's definition 
[H,i.e., 

S ' H = - E <P(e))fln(p(e))f. (1) 

e=-l,0,l 

Here, taking into account the fact that (p(— ))f is con- 
stant in the course of dialogue, we paid attention only 
to (p{+))f and (p(0))f , thus the observed entropy had a 
form of 

Si = - [<p(0))f ln( P (0))f + (p(+))f ln(p(+))f ] ■ (2) 

Plotting the difference between terminal and initial en- 
tropy AS versus the length of the dialogue L it is possible 
to see that for the dialogues up to L w 50 this difference 
is always above zero (see Fig. [3]). It implies a following 



likely scenario for the dialogue: it evolves in the direc- 
tion of growing entropy. In the beginning of the dialogue, 
the probabilities (p(0))f and (p{+))f are separated from 
each other, contributing to low value of initial entropy 
S p . However, then the entropy grows, the probabilities 
(p(0))f and (p{+))f equalize leading to high value en- 
tropy (i.e., higher than the initial one) at the end of the 
dialogue. 

However, it is essential to notice that the observed be- 
havior in the IRC data is only one of the possible scenar- 
ios of the more general phenomenon of the principle of 
maximum entropy (39j . governing also certain aspects of 
biological or social systems [4l| (at the level of social 
networks). To be more precise, we performed an anal- 
ysis analogous to this for the IRC data with respect to 
emotionally annotated dataset from the BBC Forum (see 
[23l ] and J24[) consisting of over 2 x 10 6 comments and 
almost 10 5 discussions. In this case each discussion was 
treated as a natural "dialogue", although it usually con- 
sisted of more than 2 users communicating to each other. 
Following the line of thought presented for IRC data we 
grouped all discussion of constant length and calculated 
the quantities (p(-))f , (p(0))f , (p(+))f and S? h . The 
results, shown in Fig. [H bear close resemblance to those 
obtained for IRC data: one can clearly see that while the 
negative component decreases, the positive and objective 
(partially) ones increase. It has an instant effect on the 
value of entropy which grows during the evolution of the 
discussion (topmost row in Fig. 0]). The main differ- 
ence between IRC and BBC Forum results concerns the 
component whose value decreases during the discussion 
evolution: for IRC it is the (p(0))f while for BBC Forum 
- (p(— ))f ■ It is directly connected to the fact that the 
above mentioned components play the role of "discus- 
sion fuel" [23| propelling thread's evolution. BBC Forum 
data come from such categories as "World News" and 
"UK News" and as such may lead the discussion partic- 
ipants to place comments of very negative valence. On 
the other hand #ubuntu IRC channel servers rather as a 
source of professional help which is normally expressed 
in terms of neutral dialogue. As the discussion lasts, the 
topic dilutes (BBC Forum) or the problem is being solved 
(IRC) and the dominating component dies out leading to 
maximization of entropy. 

There is also another process taking place in the sys- 
tem in question that displays a non-trivial behavior. As 
shown previously in (23|, we can talk about grouping of 
similarly emotional messages. To quantify the persis- 
tence of a specific emotion one can consider the condi- 
tional probability p(e\ne) that after n comments with 
the same emotional valence the next comment has the 
same sign. As it easy to prove, if e were an identical 
and independently distributed (i.i.d.) variable the condi- 
tional probability p(e\ne) should be independent of n and 
equal to p(e), i.e., the probability of a specific emotion 
in the whole dataset (see Table [T| In the case of the IRC 
data, the analysis shows (see Fig. [S]) that p(e\ne) is well 
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FIG. 6: (Color online) (a) Probabilities of specific valence pf 1 ( — ) (triangles), pf 1 (0) (squares) and pf i {+) (circles) in the i-th 
time window given by Eq. @ for the exemplary dialogue shown in panel (c). (b) Entropy Si in the i-th time window defined 
by Eq. ([5]) for the exemplary dialog shown in panel (c). The dotted line marks the maximal value of entropy in Eq. <(Sj i.e., 
Sr ax = 0.4 In 2.5 « 0.73. 



approximated by 

p(e\ne) — p(e\e)n a . (3) 

where p(e\e) is the conditional probability that two con- 
secutive messages have the same emotion. The discrep- 
ancy between the data and the relation obtained by ran- 
dom insertion of emotional comments (see open symbols 
in Fig. [SJ is significant. The exponents a and the condi- 
tional probabilities p(e\e) are gathered in Table HI 
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0.154 


0.19 
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TABLE I: Fundamental properties of dialogue data: probabil- 
ities of specific emotion p(e), conditional probabilities p(e\e) 
and scaling exponents for the power- law cluster growth a e . 



IV. SIMULATION DESCRIPTION 

The methodology described above proves to be success- 
ful in finding the prominent characteristic of the data in 
question, however it is rather useless if one would like to 
perform the simulations of the dialogues. It is crucial to 
choose other way for calculating the average emotional 
probabilities "on the fly" and, using the results, decide 
on the further dialogue evolution. Thus, we decided to 
work with moving time window, i.e, the probabilities of 
the specific valences in the z-th timestep are 

pf(o) = ^T l J j ZiS e(i _ j)fi , (4) 



for i > M, where S is the Kronecker delta symbol and 
M is the size of the window. Consequently, entropy Si is 
also calculated using the probabilities pf I (+) and pf 7 (0) 
as 

St = - [pf 1 (0) lnpf (0) + pf (+) lnpf (+)] . (5) 

expressing in fact the entropy in the i-th time window. 
The practical way of application is shown in Fig. [6] for 
a dialogue of L — 30 comments. In this case the size of 
the time window is set to M = 10. 

The data-driven facts presented in the previous section 
lie at the basis of the simulation of dialogues in IRC chan- 
nels data. The key point treated as an input parameter 
for this model is the observation of the preferential attrac- 
tion of consecutive emotional messages. This idea "runs" 
the dialogue, whereas the discussion is terminated once 
the difference between the entropy in the given moment 
and its initial value exceeds certain threshold. Those fea- 
tures are implemented in the following algorithm: 

(1) start the dialogue by drawing the first emotional 
comment with probability p(e), 

(2) set the next comment to have emotional valence e 
of the previous comment with probability p[e\ne) = 
p(e\e)n a " 

(3) if the drawn probability is higher than p(e\ne), set 
the next comment one of two other emotional val- 
ues (i.e., if the original e = 1, then the next com- 
ment valence is with probability p(0)/[p(0)+p(— )] 
or -1 with probability p(— )/[p(0) +p(— )]) 

(4) if the difference between entropy in this time-step 
and the initial entropy is higher than threshold level 
AS = 0.05 terminate the simulation, otherwise go 
to point (2). 
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FIG. 7: (Color online) Comparison of average emotional value (e) (panel a) and probability of specific emotion (panel b, 
(p(— ))i - squares, (p(0))f =5 ° - circles, (p(+))f =5 ° - triangles) for simulations performed according to the procedure presented 
in Sec. IIVI (full symbols) and for real data (empty symbols) for dialogue length L = 50. The real data shown are identical with 
those shown in Fig. [2j and Fig. [2j. 
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FIG. 8: (Color online) (a) Dialogue length distribution H(L) for real data (empty circles) and simulations for different values of 
the initial entropy threshold St parameter: St = 0.1 (empty squares), St = 0.5 (empty triangles), St = 0.6 (empty diamonds) 
and St = 0.67 (filled circles), (b) Dialogue length distribution H(L) for: real data (empty circles), simulations with S p = 0.67 
(filled circles) and simulations with St = 0.67 and insertion of the additional neutral comments (empty triangles). The solid 
line is for visual guidance. 



The observed valence probabilities in this simulation are 
always calculated using quantities in a moving time win- 
dow given by Eqs. dUO with M = 10. 

There is another crucial parameter connected to the 
simulation process, i.e., the initial entropy threshold St- 
When time-step i = M is reached, the entropy Si is cal- 
culated for the first time and then decision is taken: if 
Sm < St the simulation runs further, otherwise it is 
cancelled and repeated. The total number of successfully 
simulated dialogues is equal to this observed in the real 
data. 



V. SIMULATION RESULTS 

Figure [7] shows a comparison of the average emotional 
value (e)f and average emotional probabilities (p(e))f for 



the real data and simulations performed according to the 
algorithm described in the previous section for dialogues 
of length L = 50. As one can see the plots bear close 
resemblance apart from only one detail, i.e., the rising 
value for the ))f close to the end of the dialogue. 

Moreover, the simulation strongly depends on the ex- 
act value of the initial entropy threshold St which can 
be clearly seen in Fig. [8^, where the dialogue length dis- 
tribution is presented. If the St is restricted to values 
between 0.1-0.5 (empty squares and triangles) the distri- 
bution of dialogue lengths is exponential and does not 
follow the one observed in the real data (empty circles). 
Higher values of St (St = 0.6, empty diamonds) shift the 
curve closer to the data points, nevertheless the character 
is still exponential. It is only after tuning the St parame- 
ter to 0.67 that the results obtained from the simulations 
(full circles) are qualitatively comparable with the real 
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data. 



VI. APPLICATION 

It is possible to consider a direct application of the 
above described model for changing the "trajectory" of 
the dialogue. For example let us assume that a dialogue 
system |42| - [44j is included as part of the conversation and 
that its task is to prolong the discussion. In such situa- 
tion, the system that could rely on the above presented 
properties would attempt to detect any signs indicating 
that the dialogue might come to an end and react against 
it. According to observations presented in section Hill a 
marker for such event should be the growth of the en- 
tropy. In other words the dialogue system should prevent 
an increase of the entropy in the consecutive time-steps. 

In the described case, such action would be an equiv- 
alent to an insertion of an objective comment. In this 
way, an equalization between pf 1 (+) and pf f (0) is pre- 
vented and dialogue can last further. An implementa- 
tion of this rule is presented in Fig. [8)3, where one can 
compare the real data (again empty circles), a simula- 
tion including the entropy-growth rule (again full circles) 
and a simulation following the insertion of objective com- 
ments (empty triangles). While there is a drop-down in 
the numbers for the small dialogue lengths, the vast ma- 
jority of the dialogues has the maximal length (a point 
in the top-right corner). In this way the insertion of the 
objective comments is in line with the expected idea of 
dialogue prolonging. 



VII. CONCLUSIONS 

Analysis performed on the emotionally annotated dia- 
logues extracted from IRC data demonstrate that follow- 
ing such simple metrics as probability of specific emotion 
can be useful to predict the future evolution of the discus- 
sion. Moreover, all the analysed dialogues share the same 
property, i.e., the tendency to evolve in the direction of 
a growing entropy. Those features, combined together 
with the observations regarding the preferential growth 
of clusters, are sufficient to reproduce the real data by 
a rather straightforward simulation model. In the pa- 
per, we also proposed a procedure to directly apply the 
observed rules in order to modify the way the dialogue 
evolves. It appears, that insertion of comments with emo- 
tion that initially had dominated and then started to van- 
ish prolongs the discussion by lowering the entropy value. 
Those observations may be of help for designing the next 
generation of interactive software tools [4514471 ] intended 
to support e-communitics by measuring various features 
of their interactions patterns, including their emotional 
state at the individual, group and collective levels. 
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Appendix A: Dialogue extraction method 

In total, we used 994 daily files with 4600 to 18000 
utterances that share a format presented in the first col- 
umn from the left in Table |TTJ post_number [timestamp] 
{user J,d) sentiment -class with the sentimentclass e = 
{— 1;0, 1} used as marker for the emotional valence 
through this study. Moreover, we could also use informa- 
tion that specifies which user communicates, i.e., directly 
addresses, another user (see second column in Table [TTl 
shown as (addressing juserJd) — > (addressed-userjd)). 
The discovery of the direct communication links between 
two users in the IRC channel was based on the dis- 
covery of another userlD at the beginning of an utter- 
ance, followed by a comma or semicolon signs; a scheme 
commonly used in various multiple users communication 
channels. However, one has to bear in mind that this 
kind of information can be sometimes incomplete, i.e., in 
many cases users do not explicitly specify the receiver of 
his/her post. Another issue that arises is that the data 
consist of several overlapping dialogues held simultane- 
ously on one channel. It is also sometimes difficult to 
indicate the receiver of the message as only part of them 
are annotated with a user id they are dedicated to. We 
created an algorithm that addresses this issue. It consists 
of two different approaches: 

(a) if user A addresses user B in some moment in time 
and later A writes consecutive messages without 
addressing anybody specific we assume that he/she 
is still having a conversation with B 

(b) if user A addresses user B and then B writes a 
message without addressing anybody specific we as- 
sume that he/she is answering to A. 

The main parameter of such algorithm is the time t in 
which the searching is being done; in our study we use 
t = 5 minutes as the threshold value . An exemplary 
output from the algorithm is shown in the third column 
in Table [TTJ In this way we are able to extract a set of 
dialogues from each of the daily files. After processing 
the file according to above described rules another issue 
emerges: it often happens that a user gives a set of con- 
secutive messages directed to one receiver (e.g, the 8th, 
10th and 11th line in the third column in Table HI]) . To 
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(55} -> 


(20422} -1 





5 [00 : 


08] 


(20422} 1 


[00 


08] 


(20422} 


->■ (55) 


(20442) 


-> (55) 1 


(20422} 


-> (55) 1 


-1 


6 [00 : 


08] 


(55} 


[00 


08] 


(55} 


(20442) 


(55) -> 


(20442) 


(55} -> 


(20442} 


1 


7 [00 : 


09] 


(27} 


[00 


09] 


(27} -> 


(20442) 


(27} -> 


(20442) 


(27} -)■ 


(20442} 





8 [00 : 


13] (20422} 


[00 


13] 


(20422} 




(20442) 


-> (27} 


(20422} 


-> (27} 


Dialogue 2 


9 [00 : 


13] (2} -1 


[00 


13] 


(2} 












(20422} o (27) 


10 [00 


: 14] 


(20422} -1 


[00 


14] 


(20422} 


-J- (20442) 


(20442) 


-> (27) -1 









11 [00 


: 14] 


(20422} 


[00 


14] 


(20422} 




(20442) 


-> (27} 









12 [00 


: 59" 


(171} -1 


[00 


59] 


(171} - 


• (13692} 


(171) - 


■ (13692} -1 


(171} - 


► (13692) 


Dialogue 3 


13 [00 


: 59 : 


(171} 1 


[00 


59] 


(171} - 


(13692} 


(171) - 


■ (13692} 1 






(171} «-> (13692} 


14 [00 


: 59 : 


(171} 


[00 


59] 


(171) - 


(13692} 


(171) - 


- (13692} 









15 [01 


: 00 : 


(171} 1 


[01 


00] 


(171} - 


(13692} 


(171) - 


- (13692} 1 









16 [01 


: 00 : 


(13692} 


[01 


00] 


(13692} 




(13692) 


-> (171) 


(13692} 


-> (171) 


1 


17 [01 


:01] 


(171} 1 


[01 


01] 


(171) - 


• (13692} 


(171) - 


■ (13692} 1 


(171} - 


► (13692) 1 


1 


18 [01 


:01] 


(171} 1 


[01 


01] 


(171) - 


(13692} 


(171) - 


- (13692} 1 






1 


19 [01 


:01] 


(13692} 1 


[01 


01] 


(13692) 




(13692) 


-> (171) 1 


(13692} 


-> (171) 1 


1 


20 [01 


:01] 


(171} 1 


[01 


01] 


(171) 




(171) - 


- (13692} 1 


(171} - 


► (13692) 1 


-1 


21 [01 


: 02" 


(171} 1 


[01 


02] 


(171) - 


• (13692} 


(171) - 


■ (13692} 1 






1 


22 [01 


: 02 ! 


(171} 1 


[01 


02] 


(171) - 


(13692} 


(171) - 


■ (13692} 1 






-1 


23 [01 


: 02 ! 


(13692} 1 


[01 


02] 


(13692) 




(13692) 


-> (171) 1 


(13692} 


-> (171) 1 


1 


24 [01 


: 02 : 


(13692} 


[01 


02] 


(13692) 




(13692) 


(171) 








25 [01 


: 02 ! 


(171} -1 


[01 


02] 


(171) - 


(13692} 


(171) - 


■ (13692} -1 


(171} - 


> (13692) -1 




26 [01 


: 03 : 


(13692} 1 


[01 


03] 


(13692) 




(13692) 


-> (171) 1 


(13692} 


-> (171) 1 




27 [01 


: 03 : 


(13692} -1 


[01 


03] 


(13692) 




(13692) 


-> (171) -1 








28 [01 


: 03 : 


(13692} 1 


[01 


03] 


(13692) 




(13692) 


-> (171) 1 








29 [01 


: 03 : 


(171} -1 


[01 


03] 


(171) 




(171) - 


■ (13692} -1 


(171} - 


► (13692) -1 




20 [01 


: 03" 


(13692} 1 


[01 


03] 


(13692) 




(13692) 


-> (171) 1 


(13692} 


-> (171) 1 





TABLE II: The process of dialogue extraction in the IRC channel data. Columns from the left show consecutive steps of the 
algorithm: first and second show the raw data, third is data after application of the searching procedure, fourth is data after 
averaging multiple posts from the same user and fifth column gives the final output, [hh : mm] defines the timestamp in hours 
(hh) and minutes (mm), (userjd) gives the id of the user that addresses the post, (adressing-userjd) — ► (addresedjuserjid) 
gives the ids of both addressing and addressed users and value { — 1, 0, 1} shows the valence of the post. 



create a standardize version of the dialogue (A to B, B 
to A, A to B and so on), we decided to accumulate the 
consecutive emotional messages of the same user, calcu- 
late the average value e in such series and then transform 
it back into a three-sate value according to the formula 

e' = -l ee[-l;-i] 
e* = e €(-§;£) (Al) 
e* = l ee[|;l] 



In effect we obtain the set shown in the fourth column in 
Table [ill The final step of the data preparation is to di- 
vide it into separate dialogues as shown in the 5th column 
in Table [TTl In total, the algorithm produces N = 93329 
dialogues with the length between L = 11 and L = 339 
(all the dialogues with L < 10 were omitted). 
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